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Abstract 

In this paper we propose a novel framework for the construction of sparsity-inducing pri- 
ors. In particular, we define such priors as a mixture of exponential power distributions 
with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized 
hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace 
scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework 
for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly 
expressed. Moreover, the corresponding posterior distribution also follows a generalized 
inverse Gaussian distribution. These properties lead us to EM algorithms for Bayesian 
sparse learning. We show that these algorithms bear an interesting resemblance to itera- 
tively re-weighted £2 or £\ methods. In addition, we present two extensions for grouped 
variable selection and logistic regression. 

Keywords: Sparsity priors, scale mixtures of exponential power distributions, general- 
ized inverse Gaussian distributions, expectation-maximization algorithms, iteratively re- 
weighted minimization methods. 



1. Introduction 

In this paper we are concerned with sparse supervised learning problems over a training 
dataset X = {(xj, 2/j)}f =1 . The point of departure for our work is the traditional formulation 
of supervised learning as a regularized optimization problem: 

mm |L(b;A')+P A (b)}, 

where b denotes the model parameter vector, L(-) the loss function penalizing data misfit, 
P\(-) the regularization term penalizing model complexity, and A > the tuning parameter 
balancing the relative significance of the loss function and the penalty. 
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Variable selection is a fundamental problem in high-dimensional learning problems, and 
is closely tied to the notion that the data-generating mechanism can be described using 
a sparse representation. In supervised learning scenarios, the problem is to obtain sparse 
estimates for the regression vector b. Given that it is NP-hard to use the £q penalty (i.e., 
the number of the nonzero elements of b) (Weston et al., 2003), attention has focused on use 
of the l\ penalty (Tibshirani, 1996). But in addition a number of studies have emphasized 
the advantages of nonconvex penalties — such as the bridge penalty and the log-penalty — for 
achieving sparsity (Fu, 1998, Fan and Li, 2001, Mazumder et al., 2011). 

The regularized optimization problem can be cast into a maximum a posteriori (MAP) 
framework. This is done by taking a Bayesian decision-theoretic approach in which the loss 
function L(h; X) is based on the conditional likelihood of the output t/i and the penalty 
P\(h) is associated with a prior distribution of b. For example, the least-squares loss 
function is associated with Gaussian likelihood, while there exists duality between the l\ 
penalty and the Laplace prior. 

The MAP framework provides us with Bayesian underpinnings for the sparse estimation 
problem. This has led to Bayesian versions of the lasso, which are based on expressing the 
Laplace prior as a scale-mixture of a Gaussian distribution and an exponential density (An- 
drews and Mallows, 1974, West, 1987). Figueiredo (2003), and Kiiveri (2008) presented a 
Bayesian lasso based on the Expectation-Maximization (EM) algorithm. Caron and Doucet 
(2008) then considered an EM estimation with normal-gamma or normal-inverse-gaussian 
priors. In one latest work, Poison and Scott (2011b) proposed using generalized hyperbolic 
distributions, variance-mean mixtures of Gaussians with generalized inverse Gaussian den- 
sities. Moreover, Poison and Scott (2011b) devised EM algorithms via data augmentation 
methodology. However, these treatments is not fully Bayesian. Lee et al. (2010) referred to 
such a method based on MAP estimation as a quasi-Bayesian approach. Additionally, an 
empirical-Bayes sparse learning was developed by Tipping (2001). 

Recently, Park and Casella (2008) and Hans (2009) proposed full Bayesian lasso models 
based on Gibbs sampling. Further work by Griffin and Brown (2010a) involved the use of 
a family of normal-gamma priors as a generalization of the Bayesian lasso. This prior has 
been also used by Archambeau and Bach (2009) to develop sparse probabilistic projections. 
In the work of Carvalho et al. (2010), the authors proposed horseshoe priors which are a 
mixture of normal distributions and a half-Cauchy density on the positive reals with scale 
parameter. Kyung et al. (2010) conducted performance analysis of Bayesian lassos in depth 
theoretically and empirically. 

There has also been work on nonconvex penalties within a Bayesian framework. Zou 
and Li (2008) derived their local linear approximation (LLA) algorithm by combining the 
EM algorithm with an inverse Laplace transformation. In particular, they showed that the 
bridge penalty can be obtained by mixing the Laplace distribution with a stable distribution. 
Other authors have shown that the prior induced from the log-penalty has an interpretation 
as a scale mixture of Laplace distributions with an inverse gamma density (Cevher, 2009, 
Garrigues and Olshausen, 2010, Lee et al., 2010, Armagan et al., 2011). Additionaly, Griffin 
and Brown (2010b) devised a family of normal-exponential-gamma priors for a Bayesian 
adaptive lasso (Zou, 2006). Poison and Scott (2010, 2011a) provided a unifying framework 
for the construction of sparsity priors using Levy processes. 
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In this paper we develop a novel framework for constructing sparsity-inducing priors. 
Generalized inverse Gaussian (GIG) distributions (Jorgensen, 1982) are conjugate with re- 
spect to an exponential power (EP) distribution (Box and Tiao, 1992) — an extension of 
Gaussian and Laplace distributions. Accordingly, we propose a family of distributions that 
we refer to as EP-GIG. In particular, we define EP-GIG as a scale mixture of EP distri- 
butions with a GIG density, and derive their explicit densities. EP-GIG can be regarded 
as a variant of generalized hyperbolic distributions, and include Gaussian scale mixtures 
and Laplacian scale mixtures as special cases. The Gaussian scale mixture is a class of 
generalized hyperbolic distributions (Poison and Scott, 2011b) and its special cases include 
normal-gamma distributions (Griffin and Brown, 2010a) as well as the Laplacian distribu- 
tion. The generalized double Pareto distribution in (Cevher, 2009, Armagan et al., 2011, 
Lee et al., 2010) and the bridge distribution inducing the £i/ 2 bridge penalty (Zou and Li, 
2008) are special cases of Laplacian scale mixtures. In Appendix B, we devise a set of now 
EP-GIG priors. 

Since GIG priors are conjugate with respect to EP distributions, it is feasible to apply 
EP-GIG to Bayesian sparse learning. Although it has been illustrated that fully Bayesian 
sparse learning methods based on Markov chain Monte Carlo sampling work well, our main 
attention is paid to a quasi-Bayesian approach. The important purpose is to explore the 
equivalent relationship between the MAP estimator and the classical regularized estimator 
in depth. In particular, using the property of EP-GIG as a scale-mixture of exponential 
power distributions, we devise EM algorithms for finding a sparse MAP estimate of b. 

When we set the exponential power distribution to be the Gaussian distribution, the 
resulting EM algorithm is closely related to the iteratively re-weighted £2 minimization 
methods in (Daubechies et al., 2010, Chartrand and Yin, 2008, Wipf and Nagarajan, 2010). 
When we employ the Laplace distribution as a special exponential power distribution, we 
obtain an EM algorithm which is identical to the iteratively re-weighted i\ minimization 
method in (Candes et al., 2008). Interestingly, using a bridge distribution of order q (0 < 
q < 1) results in an EM algorithm, which in turn corresponds to an iteratively re-weighted 
t q method. 

We also develop hierarchical Bayesian approaches for grouped variable selection (Yuan 
and Lin, 2007) and penalized logistic regression by using EP-GIG priors. We apply our 
proposed EP-GIG priors in Appendix B to conduct experimental analysis. The experimental 
results validate that those proposed EP-GIG priors which induce nonconvex penalties are 
potentially feasible and effective in sparsity modeling. Finally, we would like to highlight 
that our work offers several important theorems as follows. 

1. Theorems 5, 6 and 7 recover the relationship of EP-GIG with the corresponding EP at 
the limiting case. These theorems extend the fact that the t-distribution degenerates 
to Gaussian distribution as the degree of freedom approaches infinity. 

2. Theorem 8 proves that an exponential power distribution of order q/2 (q > 0) can be 
always represented a scale mixture of exponential power distributions of order q with 
a gamma mixing density. 
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3. The first part of Theorem 9 shows that GIG is conjugate with respect to EP, while 
the second part then offers a theoretical evidence of relating EM algorithms with 
iteratively re-weighted minimization methods under our framework. 

4. Theorem 10 shows that the negative log EP-GIG can induce a class of sparsity penal- 
ties. Especially, it shows a class of nonconvex penalties. Finally, Theorem 11 estab- 
lishes the oracle properties of the sparse estimator based on Laplace scale mixture 
priors. 

The paper is organized as follows. A brief reviewe about exponential power distributions 
and generalized inverse Gaussian distributions is given in Section 2. Section 3 presents 
EP-GIG distributions and their some properties, Section 4 develops our EM algorithm 
for Bayesian sparse learning, and Section 5 discusses the equivalent relationship between 
the EM and iteratively re-weighted minimization methods. In Section 6 we conduct our 
experimental evaluationsd. Finally, we conclude our work in Section 7, defer all proofs to 
Appendix A, and provide several new sparsity priors in Appendix B. 



2. Preliminaries 

Before presenting EP-GIG priors for sparse modeling of regression vector b, we give some 
notions such as the exponential power (EP) and generalized inverse Gaussian (GIG) distri- 
butions. 



2.1 Exponential Power Distributions 

For a univariate random variable b G R, it is said to follow an EP distribution if the density 
is specified by 

rj~ 1/q 1 q(2riyv 1 

P(b) = jE-^ ex P("2^ " = 2T(If 6XP( "^ 16 " Ul% 

with r] > 0. In the literature (Box and Tiao, 1992), it is typically assumed that q > 1. 
However, we find that it is able to relax this assumption into q > without any obstacle. 
Thus, we assume q > for our purpose. Moreover, we will set u = 0. 

The distribution is denoted by EP(b\u, r), q). There are two classical special cases: the 
Gaussian distribution arises when q = 2 (denoted N(b\u, rj)) and the Laplace distribution 
arises when q = 1 (denoted L(b\u, r))). As for the case that q < 1, the corresponding density 
induces a bridge penalty for b. We thus refer to it as the bridge distribution. 

2.2 Generalized Inverse Gaussian Distributions 

We first let G(t]\t, 9) denote the gamma distribution whose density is 

QT 

p(v) = tt^^' 1 ex P(~ 6 ' ? ?)' t,6>0, 
I (r) 

and \G(t}\t,9) denote the inverse gamma distribution whose density is 

P (V) = S-^ (l+T) exp(-^" 1 ), r, 9 > 0. 

1 T 
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An interesting property of the gamma and inverse gamma distributions is given as follows. 
Proposition 1 Let A > 0. Then 

(1) lim^ G( V \t,t\) = S(r,\ 1/A). 

(2) Iim T _, 0O IG(f7|r,r/A) = <5(f7|l/A). 

Here 5(r)\a) is the Dirac delta function; namely, 



5(i]\a) = 



oo if 7] = a, 
otherwise. 



We now consider the GIG distribution. The density of GIG is defined as 

p(r,) = { ^ V 'l ^- 1 exp(-(ar ? + /3r / - 1 )/2), r, > 0, (1) 

where K^{-) represents the modified Bessel function of the second kind with the index 7. 
We denote this distribution by GIG(ry|7, (3, a). It is well known that its special cases include 
the gamma distribution G(rj\j, a/2) when (3 = and 7 > 0, the inverse gamma distribution 
I G (77 1 — 7, j3/2) when a = and 7 < 0, the inverse Gaussian distribution when 7 = —1/2, 
and the hyperbolic distribution when 7 = 0. Please refer to J0rgensen (1982) for the details. 
Note in particular that the pdf of the inverse Gaussian GIG(ry| — 1/2, /3, a) is 

/3 \l/2 



P(V) = (^) exp(7a^)r ? -i exp(-(ar ? + /Sr/" 1 )^), (3 > 0, (2) 
and the pdf of G\G{rj\l/2, p, a) is 



p(v) = {^) exp( v / a/?)r ? -2 exp(-(ar ? + /fy" )/2), a > 0. (3) 

Note moreover that GIG(r?|-l/2, /3, 0) and GIG(r?|l/2, 0, a) degenerate to IG(r?|l/2, /3/2) and 
G(rf\l/2,a/2), respectively. 

As an extension of Proposition 1, we have the limiting property of GIG as follows. 

Proposition 2 Let a > and j3 > 0. Then 

(1) lim 7 ^ +00 GIG(r?|7,/?,7a) = 5{n\2/a). 

(2) lim^-ooGIG^^M = 5( V \p/2). 

We now present an alternative expression for the GIG density that is also interesting. 
Let ip = -faj3 and 4> = y/a//3. We can rewrite the density of G\G(r)\j, (3, a) as 

P(V) = ^^y 7 ^ 1 exp(-V(0r/ + (0r / )- 1 )/2), 77 > 0. (4) 

Proposition 3 Let p(?7) be defined by (4) where <fi is a positive constant and 7 is a any 
constant. Then 

lim p{rj) = 5(r]\(f>). 

Finally, let us consider that the case 7 = 0. Furthermore, letting tp — > 0, we can see that 
p{rf) oc 1/77, an improper prior. Note that this improper prior can regarded as the Jeffreys 
prior because the Fisher information of EP(b\0,rj) with respect to n is rj~ 2 /q. 



5 



3. EP-GIG Distributions 



We now develop a family of distributions by mixing the exponential power EP (b\0,n,q) 
with the generalized inverse Gaussian GIG(r/|7, (3, a). The marginal density of b is currently 
defined by 

r+oo 

p(b)= / EP(b\0, V ,q)G\G( V \ 7 ,P,a)d V . 
Jo 

We refer to this distribution as the EP-GIG and denote it by EGIG(6|q, j3, 7, q). 
Theorem 4 Let b ~ EGIG(&|oj, /3, 7, q). Then its density is 

Kj^Wa^+W)) 1/(2,) 

The following theorem establishes an important relationship of an EP-GIG with the 
corresponding EP distribution. It is an extension of the relationship of a i-distribution with 
the Gaussian distribution. That is, 

Theorem 5 Consider EP-GIG distributions. Then 

(1) lim 7 ^ +00 EGIG(5|7a,/3,7,g) = EP(6|0, 2/a, q); 

(2) lim 7 ^_ 00 EGIG(5|a,-7/3, 7 ,g) = EP(6|0, 0/2, q). 

(3) lim^_^ +00 EGIG(6|a, f3, 7, q) = EP(6|0, <f>, q) where ip = y/af3 and <ft = y/a//3 6 (0, 00). 

EP-GIG can be regarded as a variant of generalized hyperbolic distributions (J0rgensen, 
1982), because when q = 2 EP-GIG is generalized hyperbolic distributions — a class of 
Gaussian scale mixtures. However, EP-GIG becomes a class of Laplace scale mixtures 
when q = 1. 

In Appendix B we present several new concrete EP-GIG distributions, obtained from 
particular settings of 7 and q. We now consider the two special cases that mixing density 
is either a gamma distribution or an inverse gamma distribution. Accordingly, we have 
two special EP-GIG: exponential power-gamma distributions and exponential power-inverse 
gamma distributions. 

3.1 Generalized t Distributions 

We first consider an important family of EP-GIG distributions, which are scale mixtures 
of exponential power EP(b\u,rj,q) with inverse gamma IG(ry|r/2, r/(2A)). Following the 
terminology of Lee et al. (2010), we refer them as generalized t distributions and denote 
them by GT(b\u, r/A, r/2, q). Specifically, the density of the generalized t is 



P (b) = J ep(%, r,, q)\G( v \T/2, r/(2\))dn = I (^) « (1 + ±\b - u\«) (6) 
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where r > 0, A > and q > 0. Clearly, when q = 2 the generalized t distribution becomes 
to a i-distribution. Moreover, when r = 1, it is the Cauchy distribution. 

On the other hand, when q = 1, Cevher (2009) and Armagan et al. (2011) called the 
resulting generalized t generalized double Pareto distributions (GDP). The density of GDP 
is specified as 



p(b) 



J L(6|0^)IG(^|r/2,r/(2A))^ = -(l + -U) , A>0,r>0. (7) 



Furthermore, we let r = 1; namely, rj ~ IG(t/|1/2, 1/(2A)). As a result, we obtain 

p(6) = ^(l + A|6|)- 3 / 2 , 

It is well known that the limit of the t-distribution at r — > oo is the normal distribution. 
We find that we are able to extend this property to the generalized t distribution. In 
particular, we have the following theorem, which is in fact a corollary of the first part of 
Theorem 5. 

Theorem 6 Let the generalized t distribution be defined in (6). Then, for A > and q > 0, 

lim GT(6|«,r/A,r/2,g) = EP(%, 1/A, q). 

Thus, as a special case of Theorem 6 in q = 1, we have 

lim GT(%,t/A,t/2,1) = L(b\u,l/X). 

T->00 

3.2 Exponential Power-Gamma Distributions 

In particular, the density of the exponential power-gamma distribution is defined by 

97+1 q-y-1 

p(6| 7 ,a) = / EP(6|0,»/, g )G(»7|7,a/2)d77 = ^ if i( V / *) 1 7,a> 0. 

2 — r(2±i)r( 7 ) * 

We denote the distribution by EG(6|a, 7, q). As a result, the density of the normal-gamma 
distribution (Griffin and Brown, 2010a) is 

p(bh,a)= [°° N(b\0,r,)G( V \ 1 ,a/2)dr,= " \ ^ / K 7 _i(V5|&|), 7,«>0. (8) 
Jo 2 7 ~ 2^/^(7) 7 2 

As the application of the second part of Theorem 5 in this case, we can obtain the following 
theorem. 

Theorem 7 Let EG(fe|A7, 7/2, q) = J °° EP(6|0, 77, q)G{rj\^/2, \-f/2)dr} with A > 0. Then 

lim EG(6|A7,7/2,?) = EP(6|0,1/A, 9 ). 

7— >oo 
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It is easily seen that when we let 7 = 1, the normal-gamma distribution degenerates 
to the Laplace distribution L(b\0, a~ 1 / 2 /2). In addition, when q = 1 and 7 = 3/2 which 
implies that \b\rj\ ~ L(b\0,r]) and r\ ~ G(r]\3/2, a/2), we have 

p(6|a) = ^exp(-V^) = ^ +CX 'L(b|0,77)G(r/|3/2,a/2)d7/. (9) 

Obviously, the current exponential power-gamma is identical to exponential power distri- 
bution EP(6|0, a _1 / 2 /2, 1/2), a bridge distribution with q = 1/2. Interestingly, we can 
extend this relationship between the Gaussian and Laplace as we as between the Laplace 
and 1/2-bridge to the general case. That is, 

Theorem 8 Let 7 = \ + ± . Then, 

an 1 /' 1 r+oo 
EP(b\0,»~ 1/2 /2,qm = -jL^r exp(-^«W) = _/ EP(6|0, ?7 ,9)G(r ? |7,a/2)dr ? . 



This theorem implies that a g/2— bridge distribution can be represented as a scale mixture 



of q— bridge distributions. A class of important settings are q = 2 1 m and 7 = \ + - — ^i 2 ^ 



where m is any nonnegative integer. 



3.3 Conditional Priors, Marginal Priors and Posteriors 

We now study the posterior distribution of rj conditioning on b. It is immediate that 
the posterior distribution follows GIG(r/|(7g— l)/g, (/3 + \b\ q ),a). This implies that GIG 
distributions are conjugate with respect to the EP distribution. We note that in the cases 
7 = 1/2 and q = 1 as well as 7 = and q = 2, the posterior distribution is GIG(ry| — 1/2, (/3 + 
|6| 9 ),a). In the cases 7 = 3/2 and q = 1 as well as 7 = 1 and q = 2, the posterior 
distribution is GIG(r?|l/2, (/3 + \b\ q ),a). When 7 = -1/2 and q = 1 or 7 = -1 and q = 2, 
the posterior distribution is GIG(ry|— 3/2, (/? + \b\ q ),a). 
Additionally, we have the following theorem. 

Theorem 9 Suppose that b\rj ~ EP(6|0, 77, q) and r] ~ GIG(r/|7, /3, a). Then 

(i) b~ EGIG(6|a,/3,7,g) and r/|6 ~ G\G{ri\{jq-l)/q, (0 + |6|«), a). 

(ii) 9 -^^ = hE(v- 1 \b) = Uv" 1 p(v\b)d V . 

When as a penalty — log p(b) is applied to supervised sparse learning, iterative re- weighted 
t\ or £2 local methods are suggested for solving the resulting optimization problem. We 
will see that Theorem 9 shows the equivalent relationship between an iterative re-weighted 
method and an EM algorithm, which is presented in Section 4. 



3.4 Duality between Priors and Penalties 

Since there is duality between a prior and a penalty, we are able to construct a penalty from 
p(b); in particular, — log p(b) corresponds to a penalty. For example, let p(b) be defined as 
in (13) or (14) (see Appendix B). It is then easily checked that — logp(6) is concave in \b\. 
Moreover, if p(b) is given in (9), then — log p(b) induces the penalty |6| 1//2 . In fact, we 
have the following theorem. 
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Theorem 10 Let p(b) be the EP-GIG density given in (5). If — logp(fr) is regarded as 
a function of \b q \, then ~ ~fj^r^ is completely monotone on (0, oo). Furthermore, when 
< q < 1, — log(p(6)) is concave in \b\ on (0, oo); namely, — log(p(b)) defines a class of 
nonconvex penalties for b. 

Here a function 4>(z) on (0, oo) is said to be completely monotone (Feller, 1971) if it 
possesses derivatives (f>^ of all orders and 

(-l) n ^ n \z) > 0, z > 0. 

Theorem 10 implies that the first-order and second-order derivatives of — log(p(6)) with 
respect to \b\ q are nonnegative and nonpositive, respectively. Thus, — log(p(6)) is concave 
and nondecreasing in \b\ q on (0, oo). Additionally, \b\ q for < q < 1 is concave in \b\ on 
(0,oo). Consequently, when < q < 1, — log(p(6)) is concave in \b\ on (0,oo). In other 
words, — log(p(6)) with < q < 1 induces a nonconvex penalty for b. 

Figure 1 graphically depicts several penalties, which are obtained from the special priors 
in Appendix B. It is readily seen that the fist three penalty functions are concave in |6| 
on (0, oo). In Figure 2, we also illustrate the penalties induced from the 1/2-bridge scale 
mixture priors (see Examples 7 and 8 in in Appendix B), generalized t priors and EP-G 
priors. Again, we see that the two penalties induced from the 1/2-bridge mixture priors are 
concave in |6| on (0,oo). This agrees with Theorem 10. 

4. Bayesian Sparse Learning Methods 

In this section we apply EP-GIG priors to empirical Bayesian sparse learning. Suppose we 
are given a set of training data {(xj,yj) : i = 1, . . . ,n}, where the Xj G MP are the input 
vectors and the yi are the corresponding outputs. Moreover, we assume that Y17=i x « = 
and YH=i Hi = 0- We now consider the following linear regression model: 

y = Xb + e, 

where y = {yi, ■ ■ ■ ,y n ) T is the nxl output vector, X = [xi, . . . , x n ] T is the nxp input 
matrix, and £ is a Gaussian error vector N(e\0, <rl n ). We aim to estimate the vector of 
regression coefficients b = (pi, . . . , b p ) T under the MAP framework. 

4.1 Bayesian Sparse Regression 

We place an EP-GIG prior on each of the elements of b. That is, 

v 

p(b|a) = ]jEGIG(6 i |a- 1 a, CT /3,7,g). 

j=i 

Using the property the the EP-GIG distribution is a scale mixture of exponential power 
distributions, we devise an EM algorithm for the MAP estimate of b. For this purpose, we 
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Figure 1: Penalty functions induced from exponential power-generalized inverse gamma 
(EP-GIG) priors in which a = 1. 



define a hierarchical model: 

[y|b,a] ~JV(y|Xb, < 7l n ), 
[ b j\Vj>°] ~ E P(&j|O,0-Jfe,g), 

febs A«] ~ GIG(?7 j |7,/3,a), 
p(a) = "Constant" . 

According to Section 3.3, we have 

[rijK<T,a,p,i\ ~GIG(» ?i |(79-l)/9, a). 
Given the ith estimates (b^, crW) of (b, cr), the E-step of EM calculates 

Q(b,<7|bW = logp(y|b,a) + ^ / log p[b jfa, aW^lbf, a® , a, 

oc -^loga-^(y-Xb) T (y-Xb) - ^loga 
I la q 



Y a E ^ I" / ^P^f , ° {t) , «, A 7) *7i 
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(a) EP-GIG( 7 =|, g = i a = 1) 



(b) EP-GIG( 7 =|, g =|,a = l) 





















\ / 


T = 1 

---t = 2 

t = 10 





(c) GT (or GDP) (q = 1 and A = 1) 




(d) GT (q = 2 and A = 1) 




(e) EG (q = 1 and A = 1) 









y= 1/2 




---y=3/2 




Y=3 






(f) EG (g = 2 and A = 


= 1) 



Figure 2: Penalty functions induced from 1/2-bridge scale mixture priors, exponential 
power-inverse gamma (or generalized t, GT) priors and exponential power-gamma 
(EG) priors. 



Here we omit some terms that are independent of parameters a and b. In fact, we only need 
calculating E[r]J^\b^\ a^) in the E-step. It follows from Proposition 17 (see Appendix A) 
that 



(t+l) A jpf (t)\ 

w \ = E{rj- \by,(T { ') 



a 



1/2 



K, 



(■yq-q- 



-i), q (y/<*mb?\ q /<r®]) 



[P+\bf\«/<T 



it) 



1/2 



-D/ff( \l<x[PMbf\*/aV\) 



(10) 
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Table 1: E-steps of EM for different settings of 7 and q. Here we omit superscripts "(t)". 



(7,9) 


7= 2'9 = 1 


7= 5,«= 1 


7 = 0,9 = 2 


7= 1,(7 = 2 


Wj = 


l+^/aCS+CT-MM) 




l+^/a^+a-ifel) 




/3+<r-i|M 


\J P+<y~ 1 \b,\ 







There do not exist analytic computational formulae for arbitrary modified Bessel functions 
K u . In this case, we can resort to a numerical approximation to the Bessel function. 
Fortunately, once 7 and q take the special values in Appendix B, we have closed-form 
expressions for the corresponding Bessel functions and thus for the Wj. In particular, we 
have from Proposition 18 (see Appendix A) that 



w 



(t+i) 



T (t) t 



1/2 



.(t) + [ .(t) a ( .(t) /3+ | 6 (*)| 9 )]l/2 



+ 



(79-1)/? = 1/2, 

(79-l)/9 = -1/2, 
(7?-l)/9 = -3/2. 



In Table 1 we list these cases with different settings of 7 and q. 

The M-step maximizes Q(b,a\h^\a^) with respect to (b,cr). In particular, it is ob- 
tained as follows: 



b(* +1 ) = argmin (y-Xb) T (y-Xb) + J^* +1) |& J f , 



r (*+l) 



<7n+2p 



{(y-Xb( t + 1 )) r (y-Xb(*+ 1 ))+^^ +1) |6j m V}. 



4.2 The Hierarchy for Grouped Variable Selection 



p}; that is, Uj =1 Ij = I and 



In the hierarchy specified previously each bj is assumed to have distinct scale rjj. We can 
also let several bj share a common scale parameter. Thus we can obtain a Bayesian approach 
to group sparsity (Yuan and Lin, 2007). We next briefly describe this approach. 

Let Ii for I = l,...,g be a partition of / = {1,2 
Ij n Ii = for j 7^ I. Let be the cardinality of Ii, and hi = {bj : j € /;} denote the 
subvectors of b, for I = 1, . . . ,g. The hierarchy is then specified as 

[y|b,(7]~JV(y|Xb, < 7l n ), 

iid 



[bj\r]i,a] ~ EP(bj\0,ai]i,q),j £ I x 



,9- 



Moreover, given a, the b; are conditionally independent. By integrating out r)i, the marginal 
density of b; conditional on a is then 



P(b;|cr) 



K liq - n (\/a{(3+a r p^)) nPl /{2 q ) 



■'llbill? 



(7(9-Pl)/(2<?) 
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which implies b; is non- factorial. The posterior distribution of rji on b; is then G I G (77; | 

cr -1 !!^!!^, a). 

In this case, the iterative procedure for (b, a) is given by 



■yii-pi 



b (t+i) = argmin ( y -Xb) T (y-Xb) + ^wf +1) ||b, 



r (*+l) 



1=1 



qn+2p 
where for I = 1, . . . , g, 



{(y-Xb( t + 1 )f(y-Xb(*+ 1 ))+^^ +1 )||bf +1 )||^}, 



i=i 



w 



a 



1/2 



[P+\\bU\\ya(t)f 2 K m (^«[/3 + ||bf)||^W]) ' 



Recall that there is usually no analytic computation for u;, . However, setting such 7; 
that = I or 2" = -i 

5 2 g 2 

/ ■ 

(m) 



2 can yield an analytic computation. As a result, we have 
1/2 



7 (*) f 



^ t) /3+l|b ) ( "||? 



(nq-pi)/q = 1/2, 



.w^H^Hg (T»ff-Pl)/9 = -V2. 

Figure 3 depicts the hierarchical models in Section 4.1 and 4.2. It is clear that when 
g = p and p\ = ■ ■ ■ = p g = 1, the models are identical. 





(a) independent (b) grouped 

Figure 3: Graphical representations. 
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4.3 Extensions to Logistic Regression 

Another extension is the application in the penalized logistic regression model for classi- 
fication. We consider a binary classification problem in which y £ {0, 1} now represents 
the label of the corresponding input vector x. In the logistic regression model the expected 
value of m is given by 

P(Vi = !| x i) = 7- - f tTT - ^i- 

1 + exp(-x| b) 

In this case a = 1 and the log-likelihood function on the training data becomes 

n 

logp(y|b) = ^[2/ilogTT; + (l-y;)log(l-7r;)]. 
i=l 

Given the tth estimate b^ of b, the E-step of EM calculates 

Q(b|b«)^logp(y|b) + ^ ! \ogp[b \ n3 )p( m \bf jaj p n )d m 

oc ^[yilogTTi + (l-y,)log(l-7r,)] - \ ^wf^b^ . 

i=l j=l 

As for the M-step, a feasible approach is to first obtain a quadratic approximation to 
the log-likelihood function based on its second-order Taylor series expansion at the cur- 
rent estimate b^ of the regression vector b. We accordingly formulate a penalized linear 
regression model. In particular, the M-step solves the following optimization problem 

v 

b(* +1 ) = argmin (y-Xb) T W(y-Xb) + V wf +1) \bj\ q , 

where y, the working response, is defined by y = Xb^ + W~ x (y — tt), W is a diagonal 
matrix with diagonal elements 7Tj(l — 7Tj), and 7r = (tti, . . . , n n ) . Note that here the n are 
evaluated at b^. 



5. Iteratively Re- weighted £ q Methods 

We employ a penalty induced from the EP-GIG prior EGIG(6|ao, fio, 7, q)- Then the penal- 
ized regression problem is 

mm -(y-Xb) T (y-Xb) - { logi^^ao^o+l^)) - log(/3 +|6,f )}, 

i=i 

which can be solved via the iteratively re-weighted £ q method. Given the tth estimate b^ 
of b, the method considers the first Taylor approximation of — log EGIG(6j \ao, (3o,7, o) w.r.t. 
\bj\ q at \b®\ q and solves the following problem 

mm l(y-Xbf (y-Xb) + A^f^f , 

3=1 
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where uf +1) = gzlSifGiGfe^^osg) ■ Jt followg from Theorem 9 _ (ii) that 

5.1 Relationship between EM and Iteratively Re-weighted Methods 

We investigate the relationship of the EM algorithm with an iteratively re-weighted £ q 
method where q = 1 or q = 2. When equating ao = a/ a, (3q = /3a and A = a, we 
immediately see that the 2ujj are equal to the Wj in (10). This implies the iteratively 
re- weighted minimization method is identical to the EM algorithm given in Section 4.1. 

When q = 2, the EM algorithm is identical to the re- weighted £2 method and corresponds 
to a local quadratic approximation (Fan and Li, 2001, Hunter and Li, 2005). And when 
q = 1, the EM algorithm is the re- weighted £\ minimization and corresponds to a local 
linear approximation (Zou and Li, 2008). Especially, when we set 7 = 1 and q = 2, the 
EM algorithm is the same as one studied by Daubechies et al. (2010). This implies that 
the re-weighted £2 method of Daubechies et al. (2010) can be equivalently viewed as an EM 
based on our proposed EP-GIG in Example 5 of Appendix B. When the EM algorithm is 
based on our proposed EP-GIG prior in Example 4 of Appendix B (i.e. 7 = 1 and q = 2), 
it is then the combination of the re-weighted £2 method of Daubechies et al. (2010) and the 
re- weighted £2 method of Chartrand and Yin (2008). 

When 7 = | and q = 1, the EM algorithm (see Table 1) is equivalent to a re- weighted £\ 
method, which in turn has a close connection with the re- weighted £2 method of Daubechies 
et al. (2010). Additionally, the EM algorithm based on 7 = \ and q = 1 (see Table 1) can 
be regarded as the combination of the above re-weighted £\ method and the re-weighted £\ 
of Candes et al. (2008). Interestingly, the EM algorithm based on the EP-GIG priors given 
in Examples 7 and 8 of Appendix B (i.e., 7 = | and q = ^ or 7 = | and q = i) corresponds 
a re- weighted £ X j2 method. 

In is also worth mentioning that in Appendix C we present EP-Jeffreys priors. Using 
this prior, we can establish the close relationship of the adaptive lasso of Zou (2006) with 
an EM algorithm. In particular, when q = 1, the EM algorithm based on the Jeffreys prior 
is equivalent to the adaptive lasso. 

5.2 Oracle Properties 

We now study the oracle property of our sparse estimator based on Laplace scale mixture 
priors. For this purpose, following the setup of Zou and Li (2008), we assume two conditions: 
(1) yi = xfb* + ei where ei,...,e n are i.i.d errors with mean and variance a 2 ; (2) 
X T X/n — > C where C is a positive definite matrix. Let A = {j : b* / 0}. Without loss of 
generality, we assume that A = {1,2, . . . ,po} with po < p. Thus, partition C as 

Cn C12 
C21 C22J ' 

where Cu is po x Po- Additionally, let b^ = {b* : j £ A} and h\ = {u n j : j £ A}. 
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We in particular consider the following one-step sparse estimator: 



b« = argmin (y-Xbf (y-Xb) + A n ± M ^f^^ , (12) 

where Q„(*) = /^(v^Av^Cv^)) and b<°) = {bf \ . . . , b { p 0) ) T is a root-n-consistent 
estimator to b* . The following theorem shows that this estimator has the oracle property. 
That is, 

Theorem 11 Let bJJ = {&$ : j e i} and „4 n = {j : bJJ + 0}. Suppose that A 

— ► 0, a n /n — >■ ci and a n /3„ — > c<i, or that Xn/n 1 ^ — > oo, X n /^/n — > 0, a n / yjn — >■ ci 
and a„/?„ — >■ C2- i/ere ci, C2 £ (0, oo). T/ien b^ satisfies the following properties: 

(1) Consistency in variable selection: limn^oo P(A n = A) = 1. 

(2) Asymptotic normality: y/n(b$ - b*) ->d iV(0, <r 2 C 11 1 ). 



6. Experimental Studies 

In this paper our principal purpose has been to provide a new hierarchical framework 
within which we can construct sparsity- inducing priors and EM algorithms. In this section 
we conduct an experimental investigation of particular instances of these EM algorithms. In 
particular, we study the cases in Table 1. We also performed two EM algorithms based on 
the generalized t priors, i.e. the exponential power-inverse gamma priors (see Section 3.1). 
For simplicity of presentation, we denote them by "Method 1," "Method 2," "Method 3," 
"Method 4," "Method 5," "Method 6," and "Method 7," respectively. Table 2 lists their 
EP-GIG prior setups (the notation is the same as in Section 3). As we see, using the 
EP-GIG priors given in Examples 7 and 8 (see Appendix B) yields EM algorithms with 
closed-form E-steps. However, the corresponding M-steps are a weighted £ 1 / 2 minimization 
problem, which is not efficiently solved. Thus, we did not implement such EM algorithms. 

For "Method 1," "Method 2," "Method 3," "Method 5" and "Method 6," we fix a = 1 
and er(°) = 1, and use the cross validation method to select f3. In "Method 4" and "Method 

7, " the parameter A was selected by using cross validation. In addition, we implemented the 
lasso, the adaptive lasso (adLasso) and the SCAD-based method for comparison. For the 
lasso, the adLasso and the re-weighted l\ problems in the M-step, we solved the optimization 
problems by a coordinate descent algorithm (Mazumder et al., 2011). 

Recall that "Method 1," "Method 2," "Method 3," "Method 4" and AdLasso in fact 
work with the nonconvex penalties. Especially, "Method 1," "Method 2" and "Method 3" 
are based on the Laplace scale mixture priors proposed in Appendix B by us. "Method 4" 
is based on the GDP prior by Armagan et al. (2011) and Lee et al. (2010), and we employed 
the £1/2 penalty in the adLasso. Thus, this adLasso is equivalent to the EM algorithm 
which given in Appendix D. Additionally, "Method 5" and "Method 6" are based on the 
Gaussian scale mixture priors given in Appendix B by us, and "Method 7" is based on the 
Cauchy prior. In Appendix C we present the EM algorithm based on the EP-Jeffreys prior. 
This algorithm can be also regarded as an adaptive lasso with weig hts l/\bf\. Since the 
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Table 2: The EP-GIG setups of the algorithms. 



Method 1 


Method 2 


Method 3 


Method 4 


EGIG(&|a-V/3,i,l) 
(9 = 1,7=5) 


EGIG(&|<7-\«7/3, |,1) 
(9 = 1,7=|) 


EGIG(&|<7 -1 , <j/3, — ±, 1) 
( 9 = 1, 7 = -I) 


GT(6|0,f,i,l) 
(9 = 1, r = 1) 


Method 5 


Method 6 


Method 7 


AdLasso 


EGIG(6|cr _1 ,cr/3, 0, 2) 
(q = 2, 7 = 0) 


EGIG(6|o— 1 ,o-/9, 1,2) 
(9 = 2,7=1) 


GT(6|0,f,i,2) 
(9 = 2, r = 1) 


ocexp(-|6| 1 /2) 
(9 = 3) 



performance of the algorithms is same to that of "Method 4" , here we did not include the 
results with the the EP-Jeffreys prior. We also did not report the results with the Gaussian 
scale mixture given in Example 6 of Appendix B, because they are almost identical to those 
with 'Method 5" or "Method 6". 

6.1 Reconstruction on Simulation data 

We first evaluate the performance of each method on the simulated data which were used 
in Fan and Li (2001), Zou (2006). Let b = (3, 1.5, 0, 0, 2, 0, 0, 0) T , X; ~ JV(0,S) with 
Sjj = 0.5 1* I , and yo = Xb. Then Gaussian noise e ~ iV(0,5 2 I n ) is added to yo to form 
the output vector y = y + e. Let b denote the sparse solution obtained from each method 
which takes X and y as inputs and outputs. Mean square error (MSE) ||yo — Xb^/n is used 
to measure reconstruction accuracy, and the number of zeros in b is employed to evaluate 
variable selection accuracy. If a method is accurate, the number of "correct" zeros should 
be 5 and "incorrect" (IC) should be 0. 

For each pair (n, S), we generate 10,000 datasets. In Table 3 we report the numbers of 
correct and incorrect zeros as well as the average and standard deviation of MSE on the 
10,000 datasets. From Table 3 we see that the nonconvex penalization methods (Methods 1, 
2, 3 and 4) yield the best results in terms of reconstruction accuracy and sparsity recovery. 
It should be pointed out that since the weights are defined as the l/\b^ j 1 / 2 in the adLasso 
method, the method suffers from numerical instability. In addition, Methods 5, 6 and 7 
are based on the re-weighted £2 minimization, so they do not naturally produce sparse 
estimates. To achieve sparseness, they have to delete small coefficients. 

6.2 Regression on Real Data 

We apply the methods to linear regression problems and evaluate their performance on three 
data sets: Pyrim and Triazines (both obtained from UCI Machine Learning Repository) and 
the biscuit dataset (the near-infrared (NIR) spectroscopy of biscuit doughs) (Breiman and 
Friedman, 1997). For Pyrim and Triazines datasets, we randomly held out 70% of the data 
for training and the rest for test. We repeat this process 10 times, and report the mean and 
standard deviation of the relative errors defined as 

2/(xj)-j/(xj) 
y(xi) 



-1 'nest 

-E 



ntest 



A 1 
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Table 3: Results on the simulated data sets. 



MSE(±STD) C IC 



MSE (±STD) C IC 



MSE (±STD) C IC 



60, 8 = 3 



Method 
Method 
Method 
Method 



0.699(±0.63 

0.700(±0.63 
0.728(±0.60 
0.713(±0.68 



4.66 
4.55 
4.57 
4.78 



0.08 
0.07 
0.08 
0.12 



n = 120. 
0.279(±0.26 

0.287(±0.30 
0.284(±0.28 
0.281(±0.26 



<5 = 3 
4.87 
4.83 
4.93 

4.89 



0.01 
0.02 
0.00 
0.01 



n = 120, 8 = 1 
0.0253(± 0.02) 5.00 0.00 
0.0256(±0.03) 4.99 0.00 
0.0253(±0.02) 5.00 0.00 
0.0255(±0.03) 5.00 0.00 



Method ! 
Method ( 
Method ' 

SCAD 
AdLasso 
Lasso 
Ridge 



1.039(±0.56 
0.745(±0.66 
0.791(±0.57 

0.804(±0.59 
0.784(±0.57 
0.816(±0.53 
1.012(±0.50 



0.30 
1.36 
0.20 



0.00 
0.00 
0.00 



3.24 0.02 
3.60 0.04 
2.48 0.00 
0.00 0.00 



0.539(±0.28 
0.320(±0.26 
0.321(±0.28 

0.364(±0.30 
0.335(±0.27 
0.406(±0.26 
0.549(±0.27 



0.26 
1.11 
0.42 



0.00 
0.00 
0.00 



3.94 0.00 
4.83 0.01 
2.40 0.00 
0.00 0.00 



0.0599(±0.03) 0.77 0.00 

0.0262(±0.02) 4.96 0.00 

0.0265(±0.02) 2.43 0.00 

0.0264(±0.03) 4.95 0.00 

0.0283(±0.02) 4.82 0.00 

0.0450(±0.03) 2.87 0.00 

0.0658(±0.03) 0.00 0.00 



where y(xj) is the target output of the test input Xj, and y(xj) is the prediction value 
computed from a regression method. For the NIR dataset, we use the supplied training 
and test sets: 39 instances for training and the rest 31 for test (Breiman and Friedman, 
1997). Since each response of the NIR data includes 4 attributes ("fat," "sucrose," "flour" 
and "water"), we treat the data as four regression datasets; namely, the input instances and 
each-attribute responses constitute one dataset. 

The results are listed in Table 4. We see that the four new methods outperform the 
adaptive lasso and lasso in most cases. In particular Methods 1, 2, 3 and 4 (the nonconvex 
penalization) yield the best performance over the first two datasets, and Methods 5, 6 and 
7 are the best on the NIR datasets. 



Table 4: Relative error of each method on the three data sets. The numbers of instances 
(n) and numbers of features (p) of each data set are: n = 74 and p = 27 in Pyrim, 
n = 186 and p = 60 in Triazines, and n = 70 and p = 700 in NIR. 





Pyrim 


Triazines 


NIR(fat) 


NIR(sucrose) 


NIR(flour) 


NIR(water) 


Method 1 


0.1342(±0.065) 


0.2786(±0.083) 


0.0530 


0.0711 


0.0448 


0.0305 


Method 2 


0.1363(±0.066) 


0.2704(±0.075) 


0.0556 


0.0697 


0.0431 


0.0312 


Method 3 


0.1423(±0.072) 


0.2792(±0.081) 


0.0537 


0.0803 


0.0440 


0.0319 


Method 4 


0.1414(±0.065) 


0.2772(±0.081) 


0.0530 


0.0799 


0.0448 


0.0315 


Method 5 


0.1381(±0.065) 


0.2917(±0.089) 


0.0290 


0.0326 


0.0341 


0.0210 


Method 6 


0.2352(±0.261) 


0.3364(±0.079) 


0.0299 


0.0325 


0.0341 


0.0208 


Method 7 


0.1410(±0.065) 


0.3109(±0.110) 


0.0271 


0.0423 


0.0277 


0.0279 


SCAD 


0.1419(±0.064) 


0.2807(±0.079) 


0.0556 


0.0715 


0.0467 


0.0352 


AdLasso 


0.1430(±0.064) 


0.2883(±0.080) 


0.0533 


0.0803 


0.0486 


0.0319 


Lasso 


0.1424(±0.064) 


0.2804(±0.079) 


0.0608 


0.0799 


0.0527 


0.0340 
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Table 5: Results on the simulated data sets. 





MSE(±STD) C IC 


MSE (±STD) C IC 


MSE (±STD) C IC 


Method 1' 
Method 2' 
Method 3' 
Method 4' 


n = 60, <5 = 3 
2.531(±1.01) 15.85 0.31 
2.516(±1.06) 15.87 0.28 
2.445(±0.96) 15.88 0.54 
2.674(±1.12) 15.40 0.30 


n = 120, 5 = 3 
1.201(±0.45) 16.00 0.14 
1.200(±0.43) 15.97 0.10 
1.202(±0.43) 15.98 0.25 
1.220(±0.45) 15.79 0.49 


n = 120, 8 = 1 
0.1335(±0.048) 15.72 0.01 
0.1333(±0.047) 15.87 0.00 
0.1301(±0.047) 16.00 0.01 
0.1308(±0.047) 16.00 0.00 


Method 5' 
Method 6' 
Method 7' 

GLASSO 

AdLasso 
Lasso 


2.314(±0.90) 5.77 0.04 
2.375(±0.92) 10.18 0.04 
2.478(±0.97) 9.28 0.05 

2.755(±0.92) 5.52 0.00 
3.589(±1.10) 11.36 2.66 
3.234(±0.99) 9.17 1.29 


1.163(±0.41) 7.16 0.03 
1.152(±0.41) 15.56 0.03 
1.166(±0.41) 14.17 0.03 

1.478(±0.48) 3.45 0.00 
1.757(±0.56) 11.85 1.42 
1.702(±0.52) 8.53 0.61 


0.1324(±0.047) 16.00 0.01 
0.1322(±0.047) 16.00 0.00 
0.1325(±0.047) 15.96 0.00 

0.1815(±0.058) 3.05 0.00 
0.1712(±0.058) 14.09 0.32 
0.1969(±0.060) 8.03 0.05 



6.3 Experiments on Group Variable Selection 

Here we use p = 32 with 8 groups, each of size 4. Let /3 1:4 = (3, 1.5, 2, 0.5) T , /3 g . 12 = 
/3 17:2 o = (6,3,4, 1) T , /3 2 5:28 = (1-5,0.75, 1,0.25) T with all other entries set to zero, while X, 
yo, and y are defined in the same way as in Section 6.1. If a method is accurate, the number 
of "correct" zeros should be 16 and "incorrect" (IC) should be 0. Results are reported in 
Table 5. 

6.4 Experiments on Classification 

In this subsection we apply our hierarchical penalized logistic regression models in Sec- 
tion 4.3 to binary classification problems over five real-world data sets: Ionosphere, Spam- 
base, Sonar, Australian, and Heart from UCI Machine Learning Repository and Statlog. 
Table 6 gives a brief description of these five datasets. 



Table 6: The description of datasets. Here n: the numbers of instances; p: the numbers of 
features. 





Ionosphere 


Spambase 


Sonar 


Australian 


Heart 


n 


351 


4601 


208 


690 


270 


P 


33 


57 


60 


14 


13 



In the experiments, the input matrix X G W nxp is normalized such that Y^a=i x ij = 
and Y17=i x ij = i for all j = 1, • • • ,p. For each data set, we randomly choose 70% for 
training and the rest for test. We repeat this process 10 times and report the mean and 
the standard deviation of classification error rate. The results in Table 7 are encouraging, 
because in most cases Methods 1, 2, 3 and 4 based on the nonconvex penalties perform over 
the other methods in both accuracy and sparsity. 
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Table 7: Misclassification rate (%) of each method on the five data sets. 





Ionosphere 


Spambase 


Sonar 


Australian 


Heart 


Method 1 
Method 2 
Method 3 
Method 4 


9.91(±2.19) 

10.19(±2.03) 
10.00(±1.95) 
10 66(±1 941 


7.54(±0.84) 
7.47(±0.85) 

7.58(±0.83) 
7 61f±0 83") 


18.71(±5.05) 

19.19(±5.18) 
19.03(±4.35) 
21 65(±5 11) 

J—i _L_ * V / .../ 1 1 J ,11} 


12.46(±2.08) 
12.56(±2.06) 
12.61(±2.15) 
12 65f±2 141 


13.83(±3.33) 
14.20(±3.50) 
14.32(±3.60) 
13 95f±3 49) 


Method 5 
Method 6 
Method 7 

SCAD 

4/2 

4 


11.51(±3.77) 
11.51(±3.72) 
11.70(±4.06) 

10.47(±2.06) 
10.09(±1.67) 
10.47(±1.96) 


8.78(±0.41) 
8.86(±0.41) 
9.49(±0.33) 

7.58(±0.83 
7.51(±0.86) 
7.57(±0.83) 


21.61(±5.70) 
21.94(±5.85) 
22.58(±5.84) 

21.94(±5.60) 
20.00(±5.95) 
21.61(±5.11) 


12.03(±1.74) 

13.24(±2.22) 
14.11(±2.48) 

12.66(±2.08) 
12.56(±2.15) 
12.66(±2.15) 


13.21(±3.14) 

14.57(±3.38) 
13.46(±3.10) 

13.83(±3.43) 
14.20(±3.78) 
13.95(±3.49) 



7. Conclusions 

In this paper we have proposed a family of sparsity-inducing priors that we call exponential 
power-generalized inverse Gaussian (EP-GIG) distributions. We have defined EP-GIG as 
a mixture of exponential power distributions with a generalized inverse Gaussian (GIG) 
density. EP-GIG are extensions of Gaussian scale mixtures and Laplace scale mixtures. As 
a special example of the EP-GIG framework, the mixture of Laplace with GIG can induce 
a family of nonconvex penalties. In Appendix B, we have especially presented five now 
EP-GIG priors which can induce nonconvex penalties. 

Since GIG distributions are conjugate with respect to the exponential power distribu- 
tion, EP-GIG are natural for Bayesian sparse learning. In particular, we have developed 
hierarchical Bayesian models and devised EM algorithms for finding sparse solutions. We 
have also shown how this framework can be applied to grouped variable selection and logis- 
tic regression problems. Our experiments have validate that our proposed EP-GIG priors 
forming nonconvex penalties are potentially feasible and effective in sparsity modeling. 

Appendix A. Proofs 

In order to obtain proofs, we first present some mathematical preliminaries that will be 
needed. 

A.l Mathematical Preliminaries 

The first three of the following lemmas are well known, so we omit their proofs. 
Lemma 12 Let linv^oo a(v) = a. Then lim^oo ^1 + —^j = exp(a). 

Lemma 13 (Stirling Formula) lim^oo {2it)1/2 J^} 2 exp( _ u) = 1- 
Lemma 14 Assume z > and v > 0. Then 

i im K »ly y2 *) = 1 

u^oo 7r 1 /22^-V2^(^-i)/2 z -^ exp(-i/) exp(-z 2 /4) 
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Proof Consider the integral representation of K v {y x l 2 z) as 

1 /"CO 

K V (V 1 ' 2 Z) = TT-VV^Vr^ + i) / (t 2 + I/* 2 )"""* COS (t)dt 

2 Jo 

= Tr-v^s^-^+D/^-^+Drr^ + 1) r ^ r cos(t)dt. 



Thus, we have 



K v (v 1/2 z) ,. Z* 00 cos(t) 
hm — — ? — — — — -. — - — 7 7- = hm / r cos(t)dt 

i/-^oo 7 r-i/22"i/-(''+i)/2 z -(''+i)r(i/+ 5) "-"Wo (1 + i 2 /(zAZ 2 )) H ^ 



/■oo 

/ cos(i)exp(-t 2 /z 2 )cft. 
JO 



We now calculate the integral J °° cos(i) exp(— t 2 /z 2 )dt for 2 > 0. We denote this integral 
by 4>{z) and let u = t/z. Hence, 

poo 

4>(z) = z / exp(— u 2 ) cos(uz)du = zf(z) 
Jo 

where f(z) = J °° exp(— u 2 ) cos(uz)du. Note that 

poo poo 

f'(z) = — / exp(— u 2 ) sm(uz)udu = - / sin(uz)<iexp(— u 2 ) 
Jo 2 J 

= ~J exp(-u 2 )cos(uz)du = -^f(z), 

which implies that f(z) = Cexp(— z 2 /4) where C is a constant independent of z. We 
calculate /(l) to obtain C. Since 

C= lim /(z) = lim / e~" cos (itz ) du = / e~ u du = —, 
1 7 *-►+<> Jo Jo 2 

we have </>(z) = -^z exp(— z 2 /4). Subsequently, 

zi 1 ^ vr -i/2 2 ^-(.+i)/2 z -(.+i)r^ + i) " 2 zem z 1 ] - 

On the other hand, it follows from Lemmas 12 and 13 that 
ton . r( r + 1/2) . = lim r( " +1 / 2 » 



^00 (2vr) 1 /2^exp(-i/) ^00 v /27ri> I/ [l+l/(2i/)] 1 ' exp(-i/) exp(-l/2) 
Thus, 

hm 



•^2 2^ 2j/ 2 exp(— z/) exp(— ^-) 
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Lemma 15 The modified Bessel function of the second kind K 7 (u) satisfies the following 
propositions: 

(1) K» = tf_ 7 (u); 

(2) K 1+l {u) = 2lK y (u) + tf 7 -i(u); 

(3) K 1/2 (u) = K_ 1/2 (u) = ^expi-u); 

( 4 ) = -5(^7-1 («) + VW) = -ViM " = W") " 

(5) For 7 G (—00, +00), K^(u) ~ exp(— -u) as u — > +00. 

Lemma 16 Let Q u (z) = K u -i(y/z) / \\fzK v {^fz)) where v G R and z > 0. T/ien, is 
completely monotone. 

Proof When ^ > 0, the case was well proved by Grosswald (1976). Thus, we only need to 
prove the case that v < 0. In this case, we let v = — r where r > 0. Thus, 

if- r -l(y/I) _ if T+ i(y/I) _ 2r K T _i(v/i) 



which is obvious completely monotone. ■ 

The following proposition of the GIG distribution can be found from J0rgensen (1982). 

Proposition 17 Let 77 be distributed according to G\G(r)\~/ , /3 , a) with a > and (3 > 0. 
Then 



We are especially interested in the cases that 7 = 1/2, 7 = —1/2, 7 = 3/2 and 7 = — 3/2. 
For these cases, we have the following results. 

Proposition 18 Let a > and /3 > 0. 

(1) If rj is distributed according to GIG (77 1 1/2, /3, a), £/ien 

a 

(2) is distributed according to G\G(rj\— 1/2, /?, a), i/ien 

7-1/ \ [P t?( -In l + V 7 ^ 

(3) 7/r/ is distributed according to GIG(t/|3/2, /3, a), i/ien 

. . 3 /3 „, 1, a 
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(4) If j) is distributed according to GIG(?7| — 3/2, /3, a), then 



E(r,- V ) 



+ 



a 



l + y/ap v ' ' (3 ' 1 + y/aP' 
Proof It follows from Lemma 15 that K 3/2 (u) = ^K 1/2 (u) = ^K_ 1/2 {u). 



We first consider the case that n ~ GIG(r/|l/2, /3, a). Consequently, E(r] *) = a//3 and 



^\i/2 if f(V^)_/ ) 3\i/21 + Va? 1 + y/afi 



Ki{yfaj3) \aJ ^fajS a 
As for the case that i] ~ GIG(?7| — 3/2, /?, a), it follows from Proposition 17 that 

E(n) = (- ] 



'py/2 K_ 1/2 (^p) _ _j_ 

.a) K_ 3/2 (y/^@) 



1 + 



and 



E( V - 



a 



+ 



a 



K- 3/2 (Vc7p) (3 l + v^3" 



Likewise, we have the second and third parts. 



A. 2 The Proof of Proposition 1 

Note that 

lim GMt.tX) = lim — — — r? T_1 exp(— rXri) 

= lim j j rf~ l exp(— rXn) (Use the Stirling Formula) 

T ^°° (2vr)2r r -2 exp(-r) 

_ lim _r]_ (At?) t 

r ^°° (2vr) 2-77 exp((Ar ? - l)r) ' 

Since ln-u < u — 1 for u > 0, with equality if and only if u = 1, we can obtain the proof. 
Likewise, we also have the second part. 

A. 3 The Proof of Proposition 2 

Using Lemma 14 

lim GIGfa| 7 ,/3,7<*)= I™ 7=^ ry^ 1 exp(-( 7 a>7 + ^~ 1 )/2) 

« 7 exp(^)exp(-/3r ? - 1 /2) 
= lim 1 — -1 1 V exp(-7(ar//2-l)) 

= li 1 ? 77 7 6X ^ (ar ? /2) 7 exp(- 7 (ar ? /2-l)) 
^+°° (2vr)2 exp(/3r ? - 1 /2) 

= S( V \2/a). 
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Again since lnu < u — 1 for u > 0, with equality if and only if u = 1, we can obtain the 
proof of Part (1). 

Let r = —7. We have 

lim GIG(r/|7, — 7/3, a) = lim GIG(??| — r, r/3, a) 

7^ — OO T— >+00 

= lim ^ffp^ rf^ exp(-(ar ? + r/3r ? - 1 )/2) 
due to K_ T {^Taf3) = K T {y/rafi). Accordingly, we also have the second part. 

A. 4 The Proof of Proposition 3 

Based on (4) and Lemma 15, we have that 

^l 2 1 

hm p(r))= hm — — — — = 5(rf\<j>). 

V>^+oo i/>^+oo ^2tt exp(2^(0?7 - l) 2 ) 

A. 5 The Proof of Theorem 4 



r+00 

p(b)= / EP(&|O,f7,(z)GIGfa|7,0,a)df7 



/ 



-00 



J exp(-r 7 - 1 |&|V2) o ( " / f ) ^ ? ? 7 " 1 exp(-(ar? + Z^ 1 )/^ 



2 2 fr(2±i) 7 2K 7 (^) 
(«//3) 7/2 



2^r(2±i)K 7 (^) 



/■ +0 ° 79-1 1 

/ r ] —- 1 cxp[-{ar ] + (f3+\b\ q )r ] ' 1 )/2}dr ] 
Jo 



K^y/Cttf+W)) 1/(2,) 

.5f [ / 3 + |5|9](7<?-l)/(2<?)_ 



A. 6 The Proof of Theorem 5 

The proof can be directly obtained from Proposition 2. That is, 

lim EGIG(6|7a,/?,7,g) = lim / EP(6|0, 7, q)Q\Q{ji\'y, 0, 70^77 

7^+00 7~5>+oo J 

EP(b\0, 7 ,q)5( V \2/a)d v = EP(6|0,2/a, g ). 



/' 



Likewise, we have the proof for the second part. As for the third part, it can be immediately 
obtained from Proposition 3. 

A. 7 The Proof of Theorem 6 

Note that 

lim GT(6|u,T/A,r/2, 9 )= lim g ( -) « (l + -\b - u\A ^ ^ . 

r^oo r^oo 2r(|)r(i) Vr/ V r 1 1 / 
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It follows from Lemmas 12 and 13 that 



and 



r^oo \ T / \ 2 / 



lim » r (W) ^i lim « /Ax{ (i+j)^-» / IN 

™2r(§)r(|)WJ ™2r(i/g)l T J ( i)=fi exp l «J 



2' 

r-1 



g /A A\i/9/ 2\V / 
= ^2T(W^ + ^) l 1 + ^J 6XP ( 
g /A\V? 
2r(l/g) V2 ) 



Thus, the proof completes. 



A. 8 The Proof of Theorem 7 

According to Proposition 1, we have 



/•OD 

lim EG(6|A 7 , 7 /2,g) = lim / EP(6|0, r,, g)G(r/| 7 /2, A 7 /2)dry 

>»oo 

= / EP(6|0,f7,g)[ lim G(r,\ 7 /2, \ 7 /2)]dr, 
Jo 

EP(b\0, V ,q)d(r l \l/X)d V 





EP(6|0,l/A,g). 



A. 9 The Proof of Theorem 8 

2 ^ g> 



With the setting that 7 = \ + ^ we have 



1 1 1 

00 



I EP(6|0, r?, 9 )Gfa| 7) a/2)d^ = <»' 4 H 4 ^(^i) 

2 ^r(2±i)r(i + i) 



g ^ V 2 1 q 



qa 



2 l+i 2 -l0F|r(f)(«l & l 9 ) 1/4 



exp(-Va|6|9) 



4T(^) 



exp(- v / ^) = EP(6|0,a- 1 / 2 /2,9/2). 



Here we use the fact that r(2±i)r(i+|) = 2 1 2{ W)^?r(l + |) = 2 l^|r(| y 
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A. 10 The Proof of Theorem 9 

The first part is immediate. We consider the proof of the second part. It follows from 
Lemma 15 that 



79 - 1 1 



2q p + \b\i 
_ i ^ ^^(v^W)) _ l 

due to that r/|6 ~ GIG(r/|(7g - !)/<?, y 7 /? + \b\i, a). 



A. 11 The Proof of Theorem 10 

For notational simplicity, we let z = \b q \, v = and <p(z) = 9 ~Q\f\q^ ■ According to the 
above proof, we have 

x _ a 1 K u ^i( y / a((3 + z)) 
[Z) ~ 2 + z) K v (y/a(J3 + z)) ' 

It then follows from Lemma 16 that 0(z) is completely monotone. 

A. 12 The Proof of Theorem 11 

Let b4 1} = b* + -£= and 

u = argmin (tf(u) := y - X(b* + * + A n f>f j, 

where 



(()) _ Van p n + a „ X 7 _ 2 (^/a„(/3 n + |6$ 0) |)) ^.^^(A, + 1)) 



^«n(/3„ + |6f I) K 7 -l(^a„(/?„ + |6f|)) ^ 7 -2(V«n(^n + l)) ' 

Consider that 

*(u) - *(0) = u r (Ix r X)u-2^Su+A n ^^ 0) {|^+^|-|6*|}. 

n Jn J V J Jn 1 J ) 

3=1 

We know that X T X/n — > C and ^£ — >a N(0, <r 2 C). We thus only consider the third term 
of the right-hand side of the above equation. Since a n (3 n — > c\ and a n — > oo (note that 
a n /n — > C2 > implies Q n — > +oo), we have 



JC 7 _i(Va„(/3 n + l)) 

K 7 _ 2 ( V / an(/?n + l)) 
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If b* = 0, then ^n(\b* + JL\ - \b*\) = \ Uj \. And since y/nbf = O p (l), we have a n \bf ] \ = 

(a n / 1 y/n)y/n\b^\ = O p (l). Hence, Q 7 _i(a n (/3 n + converges to a positive constant in 

probability. As a result, we obtain 

\ ,,(0) 



due to 



C2- 

(0) 



Vn ^ 7 -2(V«A + «n) 

If b* ± 0, thenwf -> p -J= > and^(|6*+^|-|6*|) u jS gn(b*). Thus v^+^l 

1 6* |) — 7-p 0. The remaining parts of the proof can be immediately obtained via some slight 
modifications to that in Zou (2006) or Zou and Li (2008). We here omit them. 

Appendix B. Several Special EP-GIG Distributions 

We now present eight other important concrete EP-GIG distributions, obtained from par- 
ticular settings of 7 and q. 

Example 1 We first discuss the case that q = 1 and 7 = 1/2. That is, we employ the 
mixing distribution of L(6|0,r/) with GIG(t/|1/2, (3, a). In this case, since 

Ki^y/atf+m = K_ 1/2 (y/a(l3+\b\)) = (^1^)1/4 exp(- V«(/?+l&D) 

and 

we obtain the following pdf for EGIG(6|a,/3, 1/2, 1): 

n l/2 

p(6) = ^exp( v / ^)(/3+|6|)- 1 /2 exp( _ v / a( ^ + | 6 | )) . (13) 



Example 2 The second special EP-GIG distribution is based on the setting of q = 1 and 
7 = 3/2. Since 

K 3/2 (u) = —K 1/2 {u) = — - I75 -exp(-«), 

we obtain that the pdf of GIG(??|3/2, fj, a) is 



p( V \a, P,3/2) = ^Jl^V^fK 1 * exp(-(«r ? + ^)j2) 
\/2tt \/ap + 1 



and that the pdf of EGIG(6|a, /3, 3/2, 1) is 



P(b) = a J X ^K x P (-y^WTW)). (14) 

4(V«P + 1) 



27 



Example 3 We now consider the case that q = 1 and 7 = —1/2. In this case, we have 
EG\G{b\a,/3, -1/2,1) which is a mixture of L(b\0,rj) with density GIG(r?|-l/2, /3, a). The 
density of EGIG(6|a,/3, -1/2, 1) is 

P(b) = ^^[jffl 1 + >/<*((* + \b\)) exp(-V«(/? + |6|)). 

Example 4 The fourth special EP-GIG distribution is EGIG(6|a, /?, 0, 2); that is, we let 
q = 2 and 7 = 0. In other words, we consider the mixture of the Gaussian distribution 
A^(6|0, 77) with the hyperbolic distribution GIG(r?|/3, a, 0). We now have 

1 



^■iwVPP*^ 1 ' (15) 

Example 5 In the fifth special case we set q = 2 and 7 = 1; that is, we consider 
the mixture of the Gaussian distribution N(b\0, rf) with the generalized inverse Gaussian 
G\G(r)\l,(3, a). The density of the corresponding EP-GIG distribution EGIG(6|a,/3, 1,2) is 



^ = 2K 1 (V^)^ ( - VWTW)) - <16) 

Example 6 The final special case is based on the settings q = 2 and 7 = — 1 . In this case, 
we have 

m = f>io,,)G, G (,,i-i, M *, = 1+ ,f^Mr*. 

Jo 2i^i( v / a/3)exp( v / a(/3 + 6 2 )) 

Example 7 We are also interested EP-GIG with q = 1/2, i.e. a class of bridge scale 
mixtures. In this and next examples, we present two special cases. First, we set q = 1/2 
and 7 = 3/2. That is, 



poo 

p(b)= / EP(6|0,»/,l/2)GIG(»/|3/2,/3,a)d»7 
J 



at exp( v / a/3) exp(- y'aOS+H 1 / 2 )) 



24(1 + ^5) (/3+H¥ 
Example 8 In this case we set g = 1/2 and 7 = 5/2. We now have 

p(b) = r EP(6|0, i/, 1/2)616(1/15/2, /?, a)di7 = a ^ P ^f } exp ( - y/atf+M 1 
Jo 2 4 (3 + 3v«P + otp) V v 

Appendix C. EP-Jeffreys Priors 

We first consider the definition of EP-Jeffreys prior, which the mixture of EP(t>|0, 77, q) with 
the Jeffreys prior I/77. It is easily verified that 

p(b)<x y EP(&|0,r / ,g)r ? - 1 dr ? = ||6|- 1 

and that [77 1 6] ~ \G(rj\\/q, \b\ q /2). In this case, we obtain 

E{^\b) = 
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On the other hand, the EP- Jeffreys prior induces penalty log |6| for b. Moreover, it is 
immediately calculated that 

As we can see, our discussions here present an alternative derivation for the adaptive 
lasso (Zou, 2006). Moreover, we also obtain the relationship of the adaptive lasso with 
an EM algorithm. 

Using the EP-Jeffreys prior, we in particular define a hierarchical model: 

[y|b,a]~7V(y|Xb,(7l„), 
[bjlVj,*] ~ d EP(fej|0,ar ?i ,g), 

r i ind i 

IVj] oc n j , 
p(a) = "Constant". 

It is easy to obtain that 

[ Vj \bj,a]^\G{ Vj \l/ q , o--%\V2). 
Given the tth estimates (b^\a^) of (b, a), the E-step of EM calculates 

Q(b, ( r|bW,aW)^logp(y|b, f 7) + ^ / logpfo^, <r]p{r,j\bf , a®)dr,j 

oc -^log<7-^(y-Xb) T (y-Xb) - ^loga 
2 2cr q 



Here we omit some terms that are independent of parameters a and b. Indeed, we only 
need to calculate E(n~ 1 \bf\a^) in the E-step. That is, 



'J 1 J 

2aW 



(t+l) A T-i / -l|.(t) (t)\ 

Wj =E( Vj \b) ,a^)- 

q\b)'\i 

The M-step maximizes Q(b, a\b^\a^) with respect to (b,cr). In particular, it is ob- 
tained as follows: 

v 

b (m) = argmin (y-Xb) T (y-Xb) |Vfflf +1) |i/, 

U 

ff(m) = ^{(y-xb^)Vxb^)) + t^i^i«}. 



29 



Appendix D. The Hierarchy with the Bridge Prior Given in (9) 

Using the bridge prior in (9) yields the following hierarchical model: 

[y|b,a]~iV(y|Xb,<7l n ), 

[bj\rjj,a] ~ L(bj\0,<TT}j), 

M^G( % |3/2,«/2), 
p(a) = "Constant". 

It is easy to obtain that 

[ % -|6 J -, < 7]~GIG(7/ i |l/2, a'^bjla). 
Given the tth estimates (b^,cr^) of (b,cr), the E-step of EM calculates 

V n 

cx -£log<7--L(y-Xb) T (y-Xb) - ^loga 
2 2a q 



j'=i 



Here we omit some terms that are independent of parameters u and b. Indeed, we only 
need to calculate E^J 1 ^^, cr^) in the E-step. That is, 



(t+l) A -ll.(t) (t) 



(t) 



The M-step maximizes Q(b, cr|b^, a^) with respect to (b,cr). In particular, it is ob- 
tained as follows: 

v 

b(' +1 ) = argmin (y-Xb) T (y-Xb) + V wf +1) \bj\ q , 

U 

a(t+1) = ^{(y- Xb(t+1) ) T (y- Xb(t+1) ) + E^ +1) i 6 ? +1) i1- 
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